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ABSTRACT 


Estimating student proficiency is an important task for com- 
puter based learning systems. We compare a family of IRT- 
based proficiency estimation methods to Deep Knowledge 
Tracing (DKT), a recently proposed recurrent neural net- 
work model with promising initial results. We evaluate how 
well each model predicts a student’s future response given 
previous responses using two publicly available and one pro- 
prietary data set. We find that IRT-based methods consis- 
ently matched or outperformed DKT across all data sets 
at the finest level of content granularity that was tractable 
for them to be trained on. A hierarchical extension of IRT 
hat captured item grouping structure performed best over- 
all. When data sets included non-trivial autocorrelations 
in student response patterns, a temporal extension of IRT 
improved performance over standard IRT while the RNN- 
based method did not. We conclude that IRT-based models 
provide a simpler, better-performing alternative to existing 
RNN-based models of student interaction data while also 
affording more interpretability and guarantees due to their 
formulation as Bayesian probabilistic models. 
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1. INTRODUCTION 


A key challenge for computer-based learning systems is to 
estimate a student’s proficiency based on her previous inter- 
actions with the system. Accurate estimation of proficiency 
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enables more efficient diagnosis and remediation of her weak- 
nesses and more effective advancement of her knowledge 
frontier. Proficiency estimates can also provide the student 
or teacher with actionable information to improve student 
outcomes when reported as analytics [21]. 


Two classical families of methods for estimating proficiency 
are Item Response Theory (IRT) [8, 13] and Bayesian Knowl- 
edge Tracing (BKT) [2]. IRT essentially amounts to struc- 
tured logistic regression (see Section 2.1), estimating latent 
quantities corresponding to student ability and assessment 
properties such as difficulty. BKT does not capture assess- 
ment properties but employs a dynamic representation of 
student ability. A growing body of recent work has focused 
on modeling various structural properties of students and as- 
sessments in an attempt to combine the advantages of IRT 
and BKT, for instance [14, 15, 11, 5, 10, 12, 3]). In a re 
cently proposed method known as Deep Knowledge Tracing 
(DKT) [16], a recurrent neural network was trained to pre- 
dict student responses and was shown to outperform the 
best published results ([15]) on the publicly available AS- 
SISTments data set [4] by about 20 percentage points with 
respect to the AUC metric described in Section 4. 


To investigate DKT’s advantage over traditional models, we 
compared a standard one parameter IRT model, two exten- 
sions of that model, and DKT on three data sets (two are 
publicly available and one is proprietary) on a realistic on- 
line prediction task that is typically required by computer- 
based learning systems (see Section 4), and which was con- 
sistent with the evaluation task employed in [16].’_ We re- 
produce the results of [16] on the ASSISTments data set, 
but find that proper accounting for duplicate data negates 
the claimed performance gains. For the two larger data sets, 
computational tractability hampered our ability to train DKT 
on fine-grained content labels, while training IRT-based mod- 
els scaled to handle them. Moreover, the IRT-based models’ 
best tractable performance matches or outperforms DKT’s 
best tractable performance on all data sets, with a hierar- 
chical extension of IRT performing the best in all cases. We 
conclude that for these data sets, IRT-based models provide 
simple, better-performing alternatives to DKT while also 
affording more interpretability and guarantees due to their 
formulation as Bayesian probabilistic models. 


‘Code for the IRT and DKT models, as well as in- 
structions for reproducing our results, can be found at 
github. com/Knewton/edm2016. 
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2. MODELS OF STUDENT RESPONSES 


In this section we set notation and describe the models we 
compare. Throughout, we will represent the student re- 
sponse data D as a set of tuples (s,i,r,t) indicating the 
student, item, correctness, and time of each response. In 
this paper, time will be indexed by interaction index (rather 
than wall clock time). 


2.1 Item Response Theory (IRT) 


Item Response Theory (IRT) is a standard framework for 
modeling student responses dating back to the 1950s [8, 13]. 
A single number, called the proficiency or ability, represents 
a student’s knowledge state during the course of completing 
several assessments. It is assumed that this proficiency is 
not changing during this examination.” 


The model assumes that many students have completed a 
test of dichotomous items and assigns each student s a pro- 
ficiency 6; € R. A key innovation of IRT is to model vari- 
ation across different items. In its simplest form, the one- 
parameter model, each item i is assigned a parameter [;, 
representing the difficulty of the item. The probability that 
a student s answers item i correctly is given by f(s; — (i), 
where f is some sigmoidal function. 


When f is the logistic function, this corresponds to (struc- 
tured) logistic regression, where the factors for a response to 
an item are indicators for students and items. We use a vari- 
ant of this model known as 1PO (one-parameter ogive) IRT, 
where the link function f(x) = ®(x) is the cumulative distri- 
bution function of the standard normal distribution*®. The 
maximum likelihood solution of {6;, ;} is underdetermined 
4. we take a Bayesian approach and regularize the solution 
of {0;, 6;} by imposing independent standard normal prior 
distributions over each 0, and (;. 


2.1.1 Learning 

To train the parameters on student response data, we max- 
imize the log posterior probability of {6;,8:} given the re- 
sponse data (the set of response correctnesses {r : (s,i, r,t) € 
D}, each of which is 0 or 1). Assuming independent, stan- 
dard normal priors on each 65, (3;, the log posterior is: 


log P({9s}, {Bi }|D) = 
S- rlog f(@s — Bi) + (1 
(s,i,r,t)€D 


58-5 e+e. (a) 


We maximize this objective with respect to the parameters 
using standard second-order ascent methods to obtain the 
maximum a posteriori (MAP) estimate of each parameter. 


r) log(1 — f(@s — 8:)) 


2.2 Hierarchical IRT (HIRT) 


?For an in depth discussion of IRT and a review of related 
literature see [17], especially Chapter 5. 

°The ogive yields nearly identical results to the commonly 
used logistic link function, but allows closed-form posterior 
computation in the temporal IRT model described in Sec. 2.3 
“For example, the response predictions are invariant when 
adding a constant offset to the {6;}’s and {8;}’s. 


Proceedings of the 9th International Conference on Educational Data Mining 


In many situations, including each of our data sets, the as- 
sessment items may have structure that can inform predic- 
tions of student responses. For example, groups of items may 
assess the same topic, resulting in item properties that are 
more similar within groups than across them Alternatively, 
items may be derived from common templates. Templates, 
often found in math courses, look like “What is x+y?” and 
a particular instantiation is generated by choosing values for 
xz and y. For example, the ASSISTments data set contains 
several problems, many of which are with the same template, 
many of which in turn assess a single skill. 


We can augment the IRT model to incorporate knowledge 
about item groups, resulting in a hierarchical IRT model 
(HIRT). Each item 7 is associated with a group j(i) whose 
difficulty is distributed normally around a per-group mean 
Lyi): Bi ~ N(ujq),07). Each pj is in turn distributed 
according to the hyperprior jz; ~ N(0,77). This reflects the 
belief that the difficulty of items in the same group should 
be similar. The degenerate cases provide some intuition: 
the limit o — 0 is the same model as 1PO IRT where we 
consider the items in the group to be the same item, and 
the limit 7 > 0 is equivalent to a 1PO IRT model with no 
groupings. 


2.2.1 Learning 
Learning is done similarly to Bayesian IRT (section 2.1), 
except that we ascend the modified log posterior probability 


log P({Os}, {Bi}, {ue}|D) = 
y rlog f(@s — Bi) + (1 


(s,i,r,t)€D 


1 2 1 2 1 2 oj 
7 2% ope: 26: 3(i)) a aM tC. (2) 


r)log(1 — f(@s — B:)) 


We maximize this objective with respect to {6, 8:, 1; }- 


2.3 Temporal IRT (TIRT) 

1PO IRT and HIRT assume each student’s knowledge state 
remains constant over time. However, in a setting where a 
student may be acquiring (or forgetting) knowledge over a 
period of time (e.g., while interacting with a tutoring sys- 
tem), we can extend this model by modeling each 6, as a 
stochastic process varying over time (see for example [5]). 
We adopt the approach described in [3], modeling the stu- 
dent’s knowledge as a Wiener process: 


P(Os247|0s,t) = e 227 Vs,t,7. (3) 


In other words, the change in student s’s knowledge state 
between time ¢ and a future time t + 7 (expressed as 05,4 — 
0,447) is normally distributed about 0 with variance yr 
where y is a parameter controlling the “smoothness” with 
which the knowledge state varies over time. 


2.3.1 Learning 

We fit the parameters according to the procedure described 
in [3]. Estimating the entire trajectory 65.¢ for each student 
simultaneously with item parameters is very expensive and 
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difficult to do in real-time. To simplify the approach, we 
learn parameters in two stages: 


1. We learn the 6; according to a standard 1PO IRT 
model (see Section 2.1.1) on the training student pop- 
ulation and freeze these during validation. 


2. For each response of each student in the held-out vali- 
dation population, we predict this response according 
to a temporal IRT model given the student’s previous 
responses, as described below. For further details of 
the validation procedure, see Section 4. 


For the second step, we combine the approximation: 
P(U(s',i,r,t') € D: 8’ = 5, <t}|Os4) & 
Il P((s',i,r,t')|Os,t) (4) 


(s’,i,7,t!)€D:s'=s,t!<t 


with (3), integrating out previous proficiencies of the student 
to get a tractable approximation of the log posterior over the 
student’s current proficiency given previous responses: 


log P(6s,t|D) *¥ SY) [rlog f (Gu (8s,t — Bi))+ 
(s’,i,r,t/)ED 
s'=s,t'!<t 


(1—r)log(1 — f(@ (0st — Bs)))], (5) 


where Gy = (1+ 7°(t- ine . The d,’s are essentially 


discounting the relative effect of older responses when esti- 
mating the current proficiency. See [3] for details. 


2.4 Deep Knowledge Tracing (DKT) 


Recently, a recurrent neural network was used to predict 
student responses [16]. Such architectures have seen enor- 
mous success in applications to a wide range of other do- 
mains (e.g., image processing [6], speech recognition [7], and 
natural language processing [20]). 


In this model, the input vectors are representations of whether 
the student answered a particular question correctly or in- 
correctly at the previous time step, and the output vectors 
are representations of the probability, over all the questions 
in the question bank, that a student will get the question cor- 
rect at the following time step. In [16], the authors propose 
using a one-hot vector @s,z € R”! to represent the response of 
a student s (on item 7) at time t. Here J is the total number 
of items and the first I slots represent answering correctly 
and the remaining J slots represent answering incorrectly. 
Output vectors sz € R’ are vectors of probabilities, where 
the ith element of ¥;,4 is the model’s predicted probability 
that student s would answer item 7 correctly at time t+ 1. 


We use a model with one hidden layer, of dimension H, 
which is fully connected® to both the input and output lay- 
ers, as well as recurrently to itself. This model is able to 
capture temporal effects (via the recurrent component of the 
network) and remains flexible enough to describe non-trivial 
relationships between items. 


’Note that in [16], an LSTM network was used in addition 
to the RNN described here, and the performance of the two 
networks was comparable. 
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2.4.1 Learning and Parameter Choices 

In order to make learning tractable, we reduced the dimen- 
sionality of the input by projecting the #s4 € R! to a lower 
dimensional space Ro using a random projection matrix 
c: R*! + R®, as was done in [16]. We used batch gradient 
ascent with dropout [18], and chose the input dimensional- 
ity C and the hidden dimensionality H by sweeping these 
parameters on a data set that was held out from the data 
used for training and cross-validation. 


The predictions are given by the following equations: 


Ast = g(Wrnhs.t + Wenc(€s,t) + bn) (6) 
Ysjt+1 = b(Whyhs,t41 + by) (7) 


Here, g and ¢ are the logistic and arctangent functions, re- 
spectively. The parameters of the model Whn, Wan, Why, bh, by 
are fit by optimizing the cross-entropy of the responses with 
the predicted probabilities (which is equivalent to the log 
likelihood if these probabilities were produced via a genera- 
tive probabilistic model): 


Yo rlogys,a+ (L—r)log(1— ys.) (8) 


(s,i,r,t)€D 


Stochastic gradient ascent with minibatches of students on 
the unrolled RNN, coded using Theano [1], was used to op- 
timize this objective function. 


3. DATA SETS 


In order to test these models, we used three data sets, two 
publicly accessible and one proprietary. Each of these data 
sets comes from a system in which students interact with a 
computer-based learning system in a variety of educational 
settings (e.g., interspersed with classroom lectures, offline 
work, etc.). 


3.1 ASSISTments 


This data set comes from the ASSISTments product, an 
online platform which engages students with formative as- 
sessments replete with scaffolded hints. Most assessments 
are templated, and each problem is aligned with one, sev- 
eral, or none of the skills that the product is attempting to 
teach. 


The data set [4] is divided in two parts, the “skill builder” 
set associated with formative assessment and the “non skill 
builder” set associated with summative assessment. All of 
our results are reported on the “skill builder” data set as 
we expect a stronger temporal signal from formative assess- 
ment than from summative assessment. This was also the 
evaluation data set for [16]. 


In preprocessing the data, we associated items not aligned 
with a skill to a designated “dummy” skill, as was done 
in [16]. We chose to discard rows duplicating a single in- 
teraction (represented by a unique order_id value), a step 
we do not believe was taken by [16]. These duplicate rows 
arise when a single interaction is aligned with multiple skills. 
Without removing these duplicates, models that process all 
skills simultaneously, including DKT and the IRT variants 
used in this paper, will see the same student interaction 
several times in a row, essentially providing these models 
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Figure 1: Summary of results across models and metrics. Error bars represent the standard error of measure of the metric 
across five folds. For TIRT, parameter selection yielded y? = 0.01 for ASSISTments, 7? = 0 for KDD (making it identical 
to IRT), and 77 = 100.0 for Knewton. For HIRT, parameter selection yielded o? = 0.125 and 7? = 0.5 for ASSISTments, 
o” = 0.5 and 7? = 0.25 for KDD, and o? = 0.25 and r? = 0.125 for Knewton. For DKT, C = 50, H = 100, and the probability 


of dropout is 0.25 for all models. 


access to the ground truth when making their predictions. 
This can artificially boost prediction results by a significant 
amount (see Section 5), as these “duplicate” rows account 
for approximately 25% of the rows. Indeed, we observed 
that the performance gains of DKT are negated when these 
duplicates are removed (see Section 5). Note that typical 
BKT-based approaches are not susceptible to this artificial 
boost, since they usually split the data by skill and train 
separate models. 


After pre-processing, the data set consisted of 346,740 in- 
teractions for 4,097 users on 26,684 items arising from 815 
templates and 112 skills. The overall percent correct was 
64.54%. 


3.2 KDD Cup 

In 2010, the PSLC DataShop released several data sets de- 
rived from Carnegie Learning’s Cognitive Tutor in (Pre- 
)Algebra from the years 2005-2009 [19]. We used the largest 
of the “Development” data sets, labeled “Bridge to Algebra 
2006-2007.” 


One distinct difference between Carnegie Learning’s prod- 
uct and ASSISTments is that Carnegie Learning provides 
much finer representations of the concepts assessed by an 
individual item. In particular, Carnegie Learning is built 
around scaffolded, formative assessment, where each step a 
student takes to answer a problem is counted as a separate in- 
teraction, with each step potentially assessing different skills 
(called Knowledge Components (KCs) in the data set). Note 
that this “Problem — Step” structure provides a hierarchy 
which HIRT (Section 2.2) can exploit. 


Like ASSISTments, any particular interaction may assess 
zero or more skills. We follow the same methodology as we 
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did in Section 3.1, arbitrarily but consistently retaining only 
one of the skills after preprocessing, and associating items 
not associated with any skills with a designated “dummy” 
skill. 


After pre-processing, the data set retained 3,679,198 inter- 
actions for 1,146 users on 207,856 steps arising from 19,355 
problems and 494 KCs. The overall percent correct was 
88.82%. 


3.3. Knewton 

Data was collected from a variety of educational products 
integrated with Knewton’s adaptive learning platform and 
used in various classroom settings across the world. These 
products vary with respect to the educational content used 
(disciplines spanned math, science, and English language 
learning) as well as the way in which students are guided 
through the content. For example, students may take an 
initial assessment and then be remediated on areas need- 
ing improvement. In other products, students start from 
the beginning and work toward a predefined goal set by 
the teacher. In all of these settings, Knewton receives data 
about each interaction (the (s,i,r,t) tuple of Section 2). 
We utilized approximately 1M responses of 6.3K randomly 
sampled students on 105.6K questions spanning roughly 4 
months. Students who worked on fewer than 5 questions 
total were excluded. After pre-processing, student history 
lengths ranged from 5 to 3.2K responses. The overall per- 
cent correct of these responses is 54.6%. 


4. EVALUATION METHODOLOGY 


4.1 Parameter Selection 
For each data set, 20% of students were first set aside for 
parameter selection, which we performed as follows: 
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Figure 2: Accuracy metrics for the three data sets computed using a rolling window of previous responses, as a function of 
window length. Response accuracy is computed by predicting correct if the majority of responses in the window are correct. 


IRT HIRT tIRT DKT* 


ASSISTments problem_id template_id — problem_id problem_id template_id 
KDD Step Name Problem Name — Step Name Step Name KC 


Knewton item_id 


concept_id — item_id 


item_id concept_id 


Table 1: Item labels yielding best results for each model and data set. For HIRT, the first label specifies the difficulty mean 


grouping identifier, and the second the item identifier. 


e For 1PO IRT there were no parameters to select. 


e For HIRT, we swept values of the variances rT? and o? 
of the group means and item difficulties respectively, 
including regimes (7? small) which made the model 
mathematically equivalent to 1PO IRT. 


e For TIRT, we swept the temporal smoothness param- 
eter 7’, including the regime (7? small) which made 
the model mathematically equivalent to 1PO IRT. 


e For DKT, we swept the compression dimension C' (the 
dimension of the space to which the input was pro- 
jected using a random matrix), the hidden dimension 
H, the dropout probability p, and the step size of our 
gradient ascent. 


4.2 Online prediction accuracy 

We use an evaluation method we call online response predic- 
tion which matches that of [16]. Students are first split into 
training and testing populations. Each model is first trained 
on the training population and the model parameters that 
are not student-level (item parameters for IRT-based mod- 
els, weights for neural networks) are frozen. Then for each 
time t > 1 in each testing student’s history, we train the 
student-level parameters in the model on the first t — 1 in- 
teractions of the student history and allow it to compute 
the probability that the t’th response is correct. This pro- 
cess mirrors the practical task that must be completed by 
an ITS. 


We report two different metrics for comparing the predicted 
correctness probabilities with the observed correctness val- 
ues. Accuracy (Acc) is computed as the percent of responses 
in which the correctness coincides with the probability being 
greater than 50%. AUC is the Area Under the ROC Curve 
of the probability of correctness for each response. 


We use five-fold cross validation (by partitioning the stu- 
dents) on the 80% of the data set remaining after parameter 


Proceedings of the 9th International Conference on Educational Data Mining 


selection (Section 4.1), averaging the Acc and AUC metrics 
over five different splits of the student population. 


5. RESULTS AND DISCUSSION 


Table 1 enumerates the fields chosen in each data set to iden- 
tify items and item groups (for HIRT only) that yielded the 
computationally tractable model with the best results. Note 
that for the IRT-based models, our validation scheme (Sec- 
tion 4.2) estimates a single number 03¢ for each student at 
each point t > 1 of the validation.For computational reasons, 
it was not feasible to evaluate DKT on fine-grained labels in 
KDD and Knewton (for ASSISTments, fine-grained labels 
were tractable but yielded worse results), whereas all IRT 
variants were able to process data at the finest levels. 


We trained and validated each of the three models on each 
of the three data sets as described in Sec. 4. The results on 
our evaluation task are summarized in Figure 1. The results 
clearly indicate that simple IRT-based models do as well or 
significantly better than DKT across all data sets. 


The fact that HIRT is the best-performing model across the 
board (except for MAP accuracy on the Knewton dataset 
where TIRT slightly outperforms it) suggests that grouping 
structure is useful information to exploit when predicting 
student responses. Indeed, the HIRT model does have access 
to strictly more information than the other models in that it 
has both the item and group identifier associated with each 
interaction. While the DKT model does have the ability 
to infer item relationships from data, our results indicate 
that building in this knowledge is more advantageous in a 
variety of educational settings. One potential area to explore 
is in learning a hierarchical model purely from the data, 
which could profit from the structured Bayesian framework 
without requiring prior information or expert labels. 


The temporal IRT model yielded higher accuracy on the 
Knewton dataset, but not on the other two data sets. To 
understand these effects, we investigated the degree to which 
temporal structure in the data affects predictive performance 
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by looking at how a naive “windowed percent correct” (pre- 
dict the student will answer the tth question correctly if they 
answer at least half of the previous w questions correctly) 
model performs as a function of window length w (Figure 2). 
The Knewton data set has a clear optimal window length — 
integrating over windows too short or too long degraded per- 
formance, which is indicative of nontrivial temporal struc- 
ture. However, for the ASSISTments and KDD data sets, 
longer window lengths perform equal or better than shorter 
window lengths, suggesting that static models would do just 
as well in these cases. Indeed, this would explain why TIRT 
does more or less the same as baseline 1PO IRT on ASSIST- 
ments and KDD but shows significant improvement on the 
Knewton data set. However, it does not explain why DKT 
lags regardless of the amount of temporal structure. 


Finally, we note that our DKT results in Figure 1 contradict 
those of [16] on the ASSISTments data set, which reported 
an AUC of 0.86. We believe this is due to data cleaning 
issues, specifically the issue of removing duplicates so as not 
to artificially boost online prediction accuracy, as discussed 
in Section 3.1. Indeed, we were able to reproduce the per- 
formance reported in [16] when applying our RNN imple- 
mentation on the raw data set (with duplicates left in). 


Other recent work [9] points out that the specific method 
of computing AUC in [16] also significantly affects the re- 
ported performance relative to BKT-based models, and fur- 
ther demonstrates that BKT-based models can perform just 
as well as DKT on a variety of data sets. 


6. CONCLUSION 


Our results indicate that simple IRT-based models equal or 
outperform DKT on a variety of data sets, suggesting that 
incorporating domain knowledge into structured Bayesian 
models comprises a promising area of future research for 
modeling student interaction data. 


In our experience, structured models were easier to train 
and required less parameter tuning than DKT. Moreover, 
the computational demands of DKT hampered our ability 
to fully explore the parameter space, and we found that 
computation time and memory load were prohibitive when 
training on tens of thousands of items. These issues could 
not be mitigated by reducing dimensionality without signif- 
icantly impairing performance. Further work on discrimina- 
tive models is necessary to bridge this gap, but currently, 
IRT-based models seem superior both in terms of perfor- 
mance and ease of use, making them suitable candidates 
for real-world applications (e.g. intelligent tutoring systems, 
recommendation systems, or student analytics). 


A promising avenue of research could explore combining 
the advantages of structured Bayesian models with those 
of large-scale discriminative models, which have provided 
superior performance in several other domains, particularly 
in the large-data regime. A crucial challenge for structured 
models is how to accommodate the diversity of educational 
settings from which the data are collected (different content, 
different classroom environments, etc.) while retaining the 
structure that drives predictive power and interpretability. 
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